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A system whereby speech is used as a data carrier is proposed. The 
speech, sampled at 8 kHz, is divided into blocks of N samples, and 
provided the correlation coefficient and mean square value of the 
samples exceed system thresholds, data is allowed to be transmitted. 
If the data is a logical 0, the samples are sent without modification; 
however, if a logical 1 is present, frequency inversion scrambling of 
the samples occurs. The receiver performs the inverse process to 
recover both the speech and data. Data rates of 700 b/s were achieved 
without data errors or speech distortion via an ideal channel. The 
effects of additive background and channel noise were investigated, 
and the system was shown to operate at 126 b/s with no data errors 
when the additive noise was as high as 10 dB below the mean square 
value of the speech signal. 

I. INTRODUCTION 

There are numerous schemes 1,2 for analog scrambling of speech 
signals, but they all require a scrambling key. For example, we may 
sample the speech at a rate in excess of its Nyquist rate, parcel the 
samples into blocks, and rearrange the blocks prior to transmission. 
This rearrangement of the blocks breaks up the rhythm in the speech 
making it difficult for an eavesdropper to comprehend the conversa- 
tion. The shuffling of the block positions is done under the auspices of 
the scrambling-key, and provided the receiver knows this key and, 
hence, the descrambling key, the blocks of speech can be correctly re- 
positioned and made intelligible to the desired recipient. 

It is not our purpose to describe the numerous scrambling tech- 
niques, but rather to suggest a method whereby speech and data can 
be transmitted simultaneously over the channel by using scrambling 
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strategies. The principle is very simple. The scrambling key becomes 
the data to be transmitted. The receiver adopts the role of code- 
breaker. Every time the receiver correctly guesses the key and breaks 
the code, it recovers both the speech and the data. For the scheme to 
have any significance, the receiver must break the code successfully at 
nearly every attempt. Therefore, we must select scrambling keys which 
are easy to break, and this means that we are not aiming for speech 
privacy (although a degree of privacy may be achieved as a by- 
product). The scrambling process is, therefore, a catalyst which enables 
the data to be transmitted. 

At first sight, it might appear that we are getting something for 
nothing. With care we can arrange for the data to be transmitted at 
negligible error rate, the speech faithfully recovered, and a small 
bandwidth expansion of the transmitted signal compared to the origi- 
nal speech. These rewards are derived from the inherent redundancy 
in the speech signal. Indeed, we emphasize that the method will work 
with any signal that has correlative features, such as speech, television, 
facsimile, and analog-plant control signals, like pressure and temper- 
ature variations, etc. 

II. SIMULTANEOUS SPEECH AND DATA TRANSMISSION USING 
FREQUENCY INVERSION SCRAMBLING 

As a demonstration of the concept, we describe the transmission of 
data using the simplest of scrambling methods, frequency inversion. 
In this method, speech, band-limited to 3.4 kHz, is sampled at 8 kHz 
and N samples are processed at a time. Let us represent these samples 
as 

Si = Xo, Xl, X2, • • • , xn-i • (1) 

To invert the frequency components associated with these N samples, 
all we need to do is to alter the polarity of every other sample, 3,4 
namely, 

S 2 = xo, -Xi, x 2 , -x 3 , • • • , -x N -\ (2) 

N even. 

In frequency-inversion scrambling (fis), sequence S 2 would always 
be transmitted, but in our scheme, sequence S 2 is used when we decide 
to transmit data and, further, the data is a logical 1. Observe only one 
bit per N speech samples is transmitted. 

To minimize the number of bits received in error, we proceed as 
foDows. The calculation 

N-2 

2^ XiXi+i 

i-0 



P=—N=l 

i=0 
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(3) 



is made on the original speech sequence Si and called here the 
correlation coefficient, and the mean square value 

i N-l 



2 _ 



-*£* 



(4) 



in the block of speech samples is also found. Notice that the correlation 
coefficient p» of the scrambled sequence S 2 is -p. Figure 1 shows the 
block diagram of the system. Mean square value a\ and correlation 
coefficient p are compared with system threshold parameters T and K 
in comparators COMP 1 and COMP 2, respectively. Parameters Tand 
K may be selected such that al> T and p > K generally implies the 
absence of unvoiced speech and silence, assuming there is no additive 
background noise. This strategy aids in reducing the number of re- 
ceived bit errors when transmitting through noisy channels. Later we 
will give details of how T and K are selected. 
Data is only transmitted when the Boolean equation 

v = CiC 2 (5) 

is a logical 1, where 



and 



_ [logical lifol>T 
Cl [logical Oifal<T 

r _ [logical lifp^K, 
° 2 ~ jlogical Oifp<K 
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Fig. 1 — Block diagram of the ssdt/fis system at the transmitting end for the 
simultaneous transmission of speech and data. 
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The data sequence is allowed to select Si or Sk if eq. (5) is satisfied. 
Thus, if y = 1, the switch in Fig. 1 is set in position A or B if the data 
is logical or 1, respectively, i.e., a sequence St is generated according 
to 



*-{* 



data = logical 0_ 
data = logical 1 



(6) 



Whenever y = 0, St = Si, the unscrambled speech. The sequence St is 
appropriately filtered and transmitted as the combined speech and 
data signal. 

To illustrate the effect of the imposition of data on the speech signal, 
we show the waveforms in Fig. 2. In (a) and (b) of Figure 2 an arbitrary 
segment of speech and the corresponding transmitted signal containing 
data for 120 blocks are shown, respectively. The envelope of the signal 
is barely changed, and blocks conveying zeros are not scrambled. 
Hence, the transmitted signal is perceived as a distorted version of the 
input speech — intelligible but tiresome to a listener. A smaller segment 
of the original speech signal, and the resulting transmitted signal for 
the logical values of the data shown, are displayed, respectively, in (c) 
and (d) of Fig. 2. There are now only 24 blocks, and the frequency 
inversions are apparent when the data is a logical 1. 







(c) 

Fig. 2 — Arbitrary segments of speech are shown in (a) and (c), and the corresponding 
transmitted signals are displayed in (b) and (d), respectively, N = 8, T = 0, K = 0.6. The 
logical values of the data signal are shown for the transmitted signal (d). 
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The signal emerging from the transmission channel is sampled at 8 
kHz to give St, where a caret ( - ) above the symbol signifies its 
presence at the receiver. In the absence of channel impairments 
St = St, the power a\ and correlation coefficient p of the sequence St 
in the block of N samples is computed according to eqs. (3) and (4). 
The operations associated with eq. (5) are implemented, and the 
following processes are performed until a decision is reached. 

(f) If y is a logical 1, data is assumed to be transmitted of value 
logical 0, and §t = Si is the recovered speech sequence. 

(ii) If y is a logical 0, data may or may not be present. To determine 
whether data is present, every other sample in St is inverted and the 
scrambled correlation coefficient p s is computed. Then, 

(a) if y remains a logical 0, it is decided that no data was sent. 
The recovered speech sequence is, therefore, the original received 
sequence §t- 

(b) if y becomes a logical 1, it is decided that data is present of 
value logical 1, and the recovered speech sequence is the scrambled St 
sequence. 

Observe that if the conditions are not correct for the conveyance of 
data, or if a logical is transmitted, the speech is dispatched without 
being scrambled. Only when a logical 1 is transmitted is scrambling 
performed, and this is done twice, once at the transmitter and once at 
the receiver. Should a data error occur, the speech at the output of the 
receiver may be erroneously scrambled. The resulting error samples in 
the block of length N have a rate of 4 kHz, and magnitudes double 
that of the original speech samples. 



III. PERFORMANCE PARAMETERS FOR DATA TRANSMISSION 

From a data transmission point of view we are interested in the 
transmitted bit rate (tbr) and the total bit error rate (tber). Data 
will only be transmitted when y of eq. (5) is a logical 1, and the 
efficiency tj of the system to transmit data is given by 

actual data rate 

V = (7) 

possible data rate 

from which 

Tlfs 

TBR = — , (8) 

where f s is the sampling rate of the speech signal. Error bits are those 
bits generated incorrectly at the receiver, and the number of bit errors 
per second is the tber. Let the measure of the deficiency of the system 
that results in erroneous data at the output of the receiver be known 

SPEECH AND DATA TRANSMISSION 2085 



as the data transmission deficiency, 



Then 



__ data error rate 
possible data rate 



Xfs 

tber = — (10) 



or 

TBER = BER + FBR, (11) 

where ber is the conventional bit-error rate that relates to those bits 
transmitted that were erroneously received. The term fbr is the false- 
bit rate that is associated with the generation of bits at the receiver 
when none were actually transmitted, and the declaration at the 
receiver that no bits were transmitted when they really were. Repre- 
senting the states when the transmitter does not transmit data, and 
when the receiver deems that no data was transmitted, by the symbol 
—1, and using the logical data symbols of 1 and 0, we are able to 
construct Table I, which shows all the possible data-error conditions. 
Let us consider the case of no additive noise to the speech input 
signal, and an ideal channel. In this case, the false bits are always a 
logical 1 and occur when no data (—1) was transmitted. This is state A 
in Table I. These errors occur when the power in the block is above 
the threshold, al > T, and the correlation p is below its threshold, 
p < K, prohibiting transmission of data. Now K is a positive number, 
and the bit error will occur if p is negative having a magnitude K\, say, 
that is greater than K. At the receiver, §t = Si, y = logical and, 
hence, the received sequence is scrambled. Because the correlation 



Table I — Data error table and output 
speech status 
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Note: Logical states of the data are rep- 
resented by 1 and 0. When no data is sent, 
or no data received, —1 is used. When the 
output speech at the receiver for the block 
of JV samples is correct, the symbol C is 
used; when it is scrambled, i.e., frequency 
inverted, / is used. 

2086 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1 981 



coefficient of the scrambled sequence is p s = — p = +Ki, and K\ > K, 
y is now a logical 1, and data is deemed to be present having a value 
logical 1. Thus, the probability of a false bit being generated is very 
low, being the joint probability that al> T and p < — K. 

When the speech signal is in a noisy environment, the symbols £<.) 
representing speech in eqs. (1) to (4) are replaced by x{.) = x { .) + n<.), 
where n,(.) is the noise component and the ' above the symbols means 
noise contamination. The effect of the noise is to increase o\- and 
decrease p', and as both o\- and p' must exceed their thresholds [see 
eq. (5)] for data to be transmitted, the tbr decreases. Provided the 
channel is ideal, the tber will depend on the correlative properties of 
the received speech, and the only source of errors derives from state 

A, i.e., TBER = FBR. 

When clean speech is used and the channel is noisy, the tbr is 
unaffected. However, the tber increases with channel noise power 
because the noise decorrelates the received signal, causing the receiver 
to sometimes erroneously presume that no data was transmitted. Thus, 
states D and F apply for this condition, and as the existence of other 
states occurs with a much lower probability, the received bit rate is 
approximately the difference between tbr and fbr. 

Dispersive channels alter both the power and correlation of the 
recovered signal. The most common state is D which occurs when 
p < \K\. State C occurs when the scrambled speech arrives with a 
correlation p > K, causing 1 to be interpreted as a 0. State F occurs 
whenp < \K\, or al < T, or when bothp < \K\ andal < T. The other 
states were found to rarely happen. 

IV. DATA TRANSMISSION PERFORMANCE 

The simultaneous speech and data transmission using frequency 
inversion scrambling, ssdt/fis, described here, was investigated using 
the sentences: "Live wires should be kept covered," and "To reach the 
end he needs much courage" — spoken by a male and female, respec- 
tively. The speech signal was sampled at 8 kHz to yield 38,912 samples, 
a number sufficiently large to give a good indication of the system's 
performance. The amplitude of the speech samples was confined to 
the range extending from —6000 to +6000 arbitrary units, and the 
mean square value of the samples averaged over both sentences was 
MS X = 1.09 X 10 6 or 60.4 dB relative to a mean square value of unity. 
The time waveforms for these two sentences and an expanded version 
of the magnitude of these speech samples to give the time variation of 
the low-level sounds are shown in Fig. 3. 

Our objectives were to determine how to select K, T, and N for high 
tbr and low or negligible tber, and to study how the performance 
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Fig. 3 — Time waveforms for "Live wires should be kept covered," and, "To reach the 
end he needs much courage," are shown in (a) and (c). The corresponding positive 
amplitudes of the waveforms for_ the low-level sounds (high amplitudes truncated), 
together with various values of VT, are displayed in (b) and (d), respectively. 

deteriorated in the presence of additive noise on the input speech and 
on the transmitted ssdt/fis signal. We assumed that block synchro- 
nization between transmitter and receiver was correctly maintained at 
all times. 
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4.1 Selection of K 

The two sentences were processed sequentially. The speech samples 
were divided into blocks of N samples, where N could be either 8, 16, 
32, 64, 128, or 256, resulting in 4864, 2432, 1216, 608, 304, and 152 blocks 
of samples, respectively. For each value of N the probability density 
function (pdf) was computed for the correlation coefficient p, and 
plotted in Fig. 4. The pdfs were found to have similar shapes for N = 
16 to 256, although the shape marginally altered for N = 8. For smaller 
values of N, there is a translation in the position of the pdf to lower 
values of p. This arises because of the definition of p given by eq. (3). 
The maximum possible value of p for N = 4, 3, and 2 is 0.809, 0.707, 
and 0.5, respectively. We will subsequently show that N = 4 is the 
smallest block size of interest in this transmission system; therefore, 
we do not display pdfs in Fig. 4 for N < 4. 

In the ssdt/fis system, with the threshold T set to zero, the signal 
used to transmit data is the original speech signal, for which the curves 
in Fig. 4 apply. However, if T > 0, more blocks of speech are rejected 
for the conveyance of data. Therefore, tbr decreases, and the resulting 
blocks available for data transmission have pdfs for the correlation 
coefficient that are different from those in Fig. 4. At this stage, we will 
confine the discussion to the case of T = 0, i.e., where tbr has its 
highest values, and the curves in Fig. 4 are relevant. 
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Returning to these curves, we draw attention to their most negative 
correlation coefficient, pmin, values, as they can have a significant effect 
on the number of bit errors. The variation of pmin, and the maximum 
correlation coefficient p max , as a function of N is displayed in Fig. 5. We 
recall from our discussion in Section III, that if p < K, al > T, no data 
is transmitted. Assuming an ideal channel, and given thatp = p < — K, 
the system is fooled into believing a logical 1 was transmitted and a bit 
error occurs. Clearly, if K is selected such that p< — K does not exist, 
then no bit errors are possible over an ideal channel. To avoid bit 
errors we arrange for 

A">|p m in|, (12) 

and the choice of K to avoid bit errors as a function of N must, 
therefore, be below the curve |pmin|, e.g., for N = 64, K > 0.43. For 
N < 16, pmin, and pmax both decrease with decreasing N, and for N = 4 
we have the interesting situation that | pmin | = Pmax, which means that 
if Inequality eq. (12) is satisfied no data will be transmitted as p > K 
cannot exist. The value, N = 4, therefore, marks the lower limit of the 
block size for combined speech and data transmission over an ideal 
channel without the occurrence of bit errors. 

Reducing K from pmax increases the number of speech blocks that 
can be considered for the conveyance of binary data, but if K < | pmin | , 
bit errors ensue. Thus, in order to transmit the greatest amount of 
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Fig. 5 — Variation of maximum (pm«) and minimum (p m i„) correlation coefficient 
values for different block sizes (TV)- 
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Fig. 6 — Variation of data transmission efficiency (tj) and data transmission deficiency 
(X), as a function of K and T for N = 8. 



data without errors over an ideal channel, K is bounded by 

| Pmin | < K < Pmax • 



(13) 



However, the negative tails of the pdfs are long and of low amplitude 
and, hence, K < |pmin| can be used provided the penalty of a low tber 
can be tolerated. 

4.2 Ideal channel 

The effects of parameters K and T on the data transmission effi- 
ciency n, the tbr, the data transmission deficiency A, and the total 
tber, for block sizes of 8 and 32, is shown in Figs. 6 and 7, respectively. 
These two block sizes were selected because N = 8 provides the highest 
data transmission rate in the absence of false bits, and N = 32 has 
fewer false bits than does N = 8 at low values of K, while having a 
relatively high tbr. Because the shape of the curves in Figs. 6 and 7 
are similar, we refrain from showing curves for other values of N. 

The curve for T = is of interest as it provides the highest values of 
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Fig. 7 — Variation of data transmission efficiency (tj) and data transmission deficiency 
(A), as a function of K and T for N = 32. 



tj, and relates to the discussion in Section 4.1. If k is increased beyond 
0.8 the curve falls rapidly, becoming zero for K > pmax. As If is reduced 
below 0.2, tj climbs towards 100 percent, but X becomes excessive. 
Consider the operating condition: T = and K = |p„un|. For N = 8, 
K = 0.6, tj = 71 percent, X = 0, yielding a tbr of 710 b/s and a tber of 
zero. By reducing K to 0.2 while maintaining T = 0, tj is increased to 
90 percent, or to a tbr of 898 b/s. However, X is now 3.35 percent, 
giving a tber = fbr of 33.5 b/s. These fbrs arise because K is below 
|pmin| . The system can operate with low values of K, 0.2 say, provided 
T is increased. By raising the value of T, blocks which occur during 
silence and unvoiced periods are not considered for data transmission. 
For still higher values of T, blocks existing during silence, unvoiced, 
and low amplitude voiced sounds, are rejected for the conveyance of 
data. The values of T used in our experiments (other than T = 0), 
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namely 50, 100, 10 3 , 10", 4.5 X 10 4 , 2 X 10 5 ; correspond to power levels 
that are 43.4, 40.4, 30.4, 20.3, 13.8, and 7.36 dB, respectively, below the 
mean square value MS X of the combined speech signals. The VT 
thresholds, except \/50, are shown in Fig. 3(b) and (d), and for reference 
(2 X 10 5 ) 1/2 is shown in Fig. 3(a) and (c). 

The effect of using non-zero values of T is a modification of the 
shape of the pdf of the correlation coefficient p — specifically, the 
truncation of its long negative tail. Consequently, p m in, a negative value 
in Fig. 3, is made significantly more positive. For example, when TV = 
8, pmin = -0.6, -0.56 and -0.07 for T = 0, 10 3 and 2 X 10 5 , respectively. 
When T = 2 X 10 5 , no false bits occur, irrespective of K, as shown in 
Fig. 6, but tj decreases significantly to 48 percent, giving a tbr of 480 
b/s. Clearly, for the ideal channel, there is no advantage in making T 
anything other than zero and K = 0.6. We will find that in the presence 
of channel noise T must have a high value if A is to be contained. 

Although increasing TV generally produces higher values of tj, as can 
be seen in Fig. 7, the larger block size results in a significant reduction 
in tbr. The effect of TV on A is seen to be small; therefore, we 
recommend the use of TV = 8. 

The effect of block size TV on tbr and tber for different values of T 
is shown in Fig. 8, where K = 0.5. The data transmission efficiency is 
approximately independent of TV for a given T, and consequently tbr 
is inversely proportional to TV. [See eq. (8)]. False-bit errors occur for 
TV = 16 and 8 as p m i n < —0.5, unless T is increased to =4.5 X 10 4 . By 
using this high value of T, tbr is seen to fall from 530 b/s to 19.5 b/s 
as TV is increased from TV = 8 to 256, the tber being maintained at 
zero. Clearly, from a data transmission point of view, high values of TV 
are undesirable, although they do increase the listening fatigue of an 
eavesdropper, as described in Section 5.2. 

The small block size of TV = 4 that spans a duration of 0.5 ms can be 
used to transmit data without error, provided a large value of T is used 
to remove the long tail in the correlation coefficient pdf of Fig. 4. We 
found that if T = 2 X 10 5 , A = 0, and tj = 31.6 percent. This represents 
a tbr of 632 b/s and a tber of zero. 

4.2. 1 Effect of background noise 

To simulate a noisy environment, we added a random noise sequence 
having a power a™ to the speech sequence. The effect of this back- 
ground noise power on the data transmission efficiency tj and tbr for 
different values of threshold T is shown in Figs. 9 and 10 for TV = 8, 
K = 0.6, and TV = 32, K = 0.5, respectively. An additive power level of 
o 2 ni = 10*, k = 0, 1, 2, • • ■ , corresponds to a power level of 60.4-10 k, dB, 
below the mean square value MS X of the speech signal. As expected, 
the highest value of tj occurs when T = 0, as the only criterion applied 
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Fig. 8— Variation of data transmission efficiency (tj) and data transmission deficiency 
(\), as a function of block size (N) for different values of T, K = 0.5. 

in the selection of blocks to convey data is based on whether the 
correlation coefficient p is above K. Increasing ah causes the speech to 
be decorrelated; therefore, less blocks have p > K, and consequently 
tj decreases. Increasing T means that fewer blocks fulfill the condition 
that y of eq. (5) is a logical 1. The blocks discarded are generally those 
containing low-level speech, and it is these blocks that experience 
greatest decorrelation. Thus, as ah increases, tj remains constant as 
the decorrelative effect is masked by the value of T. When ah ap- 
proaches T, blocks not rejected because their mean value is >T are 
now abandoned because of p being too small due to the decorrelation. 
Consequently, tj versus ah is no longer a constant, and tj coalesces with 
the T = curve as ah is further increased. This occurs because the 
controlling factor in block rejection is now the correlation criterion. 
For ah > T, tj decreases at approximately 6.8 percent per decade 
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Fig. 10— Effect of additive background noise power (aid) on data transmission 
ciency (tj) and data transmission deficiency (X), as a function of threshold T. K - 
N - 32, X = 0. 
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increase in ah when N is 32, and at a rate approaching this value when 

Nis8. 

The data transmission deficiency A and the tber was found to be 
zero for the case of N = 32, over the range of power levels shown in 
Fig. 10. However, when the block size was 8, data errors occurred. The 
variation of A and tber with a 2 ni for different values of T is shown in 
Fig. 9. The figure demonstrates that data errors can be avoided for 
a 2 ni < 10 4 by setting T to 2 X 10 5 , although the data transmission 
efficiency falls to 43.5 percent, i.e., tbr = 435 b/s. 

4.3 Noisy channel 

When no noise was added to the speech signal, but the channel was 
noisy with additive channel noise power ale, f] was uneffected. How- 
ever, A increased due to blocks that did not contain data (al < T) but 
had their power increased to al + a 2 nc ^ T, and if p or p s exceeded K, 
data errors ensued. The variation of A and tber with a 2 nc for various 
values of T is displayed in Figs. 11 and 12, for N = 8 and 32, 
respectively. As A and tber had zero values for large values of T when 
N = 32, we present a zero line in Fig. 12, and the lines from this base 
to the other points on the curve are dotted. Notice in Fig. 12 that no 
data errors were recorded over the entire range of a\ c when T = 2 X 
10 5 , and from Figure 10 this value of T corresponded to tj = 50.5 
percent. Thus, by using N = 32, and a background noise power and 
additive channel noise power up to 10 5 , i.e., up to 10.4 dB below MS X , 
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we found that 126 b/s can be transmitted without error. When N = 8, 
T =2X 10 5 , and both types of noise are present up to 10 3 , tbr = 435 
b/s, but tber is no longer zero, having a value between 0.2 to 3.5 b/s. 
The various combinations of tbr and tber can be deduced from Figs. 
9 through 12. 

V. SPEECH TRANSMISSION PERFORMANCE 

Emphasis has been given to data transmission because we wanted 
to investigate if it could be reliably achieved using speech as a carrier 
signal. In the previous section, we presented results showing that it 
was possible to transmit data without transmission errors, and conse- 
quently the recovered speech signal was unimpaired by conveying the 
data. However, we have also observed that the data rate can be 
increased if bit errors can be tolerated. Thus, we now address the 
problem of how the bit errors affect the recovered speech signal, and 
specifically ask, How serious is the degradation of speech quality and 
intelligibility when the total tber approaches the maximum values 
found in our experiments? 




ADDITIVE CHANNEL NOISE POWER(T, 



Fig. 11 — Effect of additive channel noise power (o„c) on data transmission efficiency 
(tj) and data transmission deficiency (X), as a function threshold T.K=* 0.6, N = 8. 
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Fig. 12— Effect of additive channel noise power (ale) on data transmission efficiency 
(tj) and data transmission deficiency (X), as a function of threshold T. K = 0.5, N = 32, 
tj = 63.7. 



5. 1 Objective measure 

To provide an objective measure of the degradation of the recovered 
speech signal that accrues solely from the effect of data errors and not 
from the presence of additive background or channel noise, we elected 
to use segmental s/n (seg-s/n). This ratio is used 5,6 as an objective 
measure because its value corresponds more closely to perceived 
quality than those of conventional s/n measurements, i.e., those using 
the ratio of mean square signal power to mean square noise power, 
determined over the duration of the entire signal. The reason for this 
resides in the computation of seg-s/n, which is performed as follows. 
The input speech sequence {**} is divided into contiguous blocks of 
128 samples, i.e., into periods of 16 ms. Only those blocks, for example, 
M, whose rms value exceeds T dB (here -60 dB) relative to the peak 
value, have their s/n calculated. The s/n for they'th block is 



s/n, = 10 logio 



X X \28 j+i 



X (*128/+i ~~ Xi28j+i)' 
j-1 



y = l,2, ...,M, (14) 

2098 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1 981 



where {**} is the recovered speech sequence. There is no point in 
considering s/n, < — 10 dB, or > +80 dB, because the speech quality is 
not perceived worse or better than — 10 dB or +80 dB, respectively. 
Hence, 

s/n, = -10 dB for s/n, < -10 dB 

and 

s/n, = +80 dB for s/n, ^ +80 dB (15) 

for j = 1, 2, • • • , M. The number of data blocks of length N contained 
in the s/n, calculation block size of 128, decreases from 16 to 0.5 as N 
is increased from 8 to 256, respectively. For a given tber, the effect of 
increasing N is to cause the number of blocks not having the maximum 
s/n, of 80 dB to decrease, but the decrease in s/n, is more substantial 
in those blocks of 128 samples where erroneous scrambling occurred 
with all the samples, compared to those blocks where only, 8 samples 
were erroneously scrambled. 

The segmental s/n is computed as the average of s/n,; j = 1, 
2, • • • , M, namely, 

1 M 

seg-s/n = — £ s/n, . (16) 

Af,_i 

The r threshold enables us to ignore blocks of low-level speech in the 
calculation of seg-s/n. Even if these blocks are erroneously scrambled 
at the receiver output, their removal from the calculation is justified 
on the basis that the effect on such a low-level sound is imperceptible. 

Let us now consider the case of speech conveying data through an 
ideal channel without the introduction of data errors. Here seg-s/n is 
80 dB, as s/n„ for ally, is forced to 80 dB. When data errors occur, the 
seg-s/n does not fall greatly below 80 dB, implying that the degrada- 
tion in speech quality because of the presence of data errors is small. 
Perceptual observations substantiate this implication, confirming our 
decision to use seg-s/n as an objective measurement of performance. 

When the input speech is contaminated by statistically independent 
background noise, the recovered speech at the receiver is the sum of 
the original speech and noise signals, provided there are no data errors. 
Whenever a data error occurs, both the speech and the noise in the 
block are scrambled. In the case of additive channel noise, and for no 
data errors, the recovered speech is, again, the sum of the original 
speech and the noise signals, except when the data is a logical 1 when 
the output noise signal is scrambled. The effect of data errors, exclud- 
ing states B and F, is to cause the output signal to be the sum of the 
scrambled original speech signal and the scrambled noise signal. Per- 
ceptually, the scrambled noise signal is the same as the original random 
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noise signal; therefore, its effect must be removed in evaluating the 
loss of seg-s/n because of data errors. This is achieved by noting those 
blocks where data errors occurred in a noisy environment, and then 
scrambling the original speech in those same blocks to give the 
sequence {x k } used in eq. (14). By this method, we are able to separate 
the distortion in the output signal, caused by the additive noise, from 
the effect of the data errors that were precipitated by this noise. 

5.1.1 Objective results 

The occurrence of a bit error results in a block of speech samples at 
the output of the receiver being erroneously scrambled. Clearly, if this 
happens to a block containing high amplitude samples, significant 
distortion ensues. From Section IV we have seen that the reduction or 
elimination of data errors can be achieved by increasing T. However, 
this is not our purpose here. We wish to generate data errors and 
observe their effect on the recovered speech. Consequently, in selecting 
conditions to illustrate the reduction in seg-s/n caused by data errors, 
we have opted for T = 0. Table II shows the seg-s/ns for some of the 
worst data-error conditions shown in Figs. 5 to 12, plus a high data- 
error case when N = 256. These conditions were selected to show that 
in spite of the tber values being unacceptably high for most data 
communications systems, the effect of the data errors on speech quality 
is small. Indeed, seg-s/n remains above 66 dB for all the conditions 
depicted in Table II. Therefore, we refrain from presenting detailed 
measurements of speech distortion that is barely perceptible. However, 
we do discuss the error conditions, but cannot make general deductions 
from the few entries in the table, particularly as there is not a consistent 
theme. For example, for the noisy channel there are three values of N, 
but they each have a different K, so comparisons must be tempered 
with caution. 

In Table I we have included the recovered speech status at the 
receiver for each of the data-error states. Two states, B and F, do not 
cause the recovered speech signal to have the samples in the erroneous 
blocks scrambled. When additive channel noise is present, error states 
D and F occur more frequently than the other states. Thus, the 
distortion in the output speech results mainly from state D, i.e., when 
a transmitted logical 1 is ignored. As states D and F are likely to occur 
with approximately the same probability (see Table II), the error rate 
in terms of distorting the recovered speech, can be considered to be 
reduced by a factor of two. However, state D is associated with a loss 
of a data signal caused by the channel noise increasing the correlation 
of a block of speech samples. Because data was transmitted, the speech 
can be voiced (although T = for Table II) in which case the speech 
distortion may be substantial. 
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The data errors resulting from the ideal channel, with or without 
additive background noise, cause state A to apply, as shown by the 
examples in Table II. Although the output speech blocks are erro- 
neously scrambled every time a data error occurs, the distortion is 
confined to blocks that often contain unvoiced sounds as the data error 
is the result of no data being transmitted, but a logical 1 being falsely 
generated. 

Waveforms for the worst condition shown in Table II, namely, 
additive channel noise, N = 8, T = 0, K = 0.6, <xL = 918, are displayed 
in Fig. 13. The high amplitude signal levels are seen to be substantially 
unaffected by the data errors whose effects are often immersed in the 
channel noise, and are therefore not perceptibly annoying. 

5.2 Informal listening experiences 

Informal listening tests were performed for the conditions listed in 
Table II. The recovered two sentences of speech, stripped of noise, 
with blocks of speech erroneously scrambled where data errors oc- 
curred, suffered only minor distortions. For the ideal channel, minor 
distortions resembling a "sshing" sound, occurred on three occasions 
for the cases of N = 8 and 32. A quiet noise, like additive white noise, 
was perceived for N = 8 when either background, or channel noise 
(the worst condition) were present. For the noisy channel condition, 
N = 32 produced the effect of barely perceptible scratches, while N = 
256 yielded the least distortion, where the degradations were reminis- 
cent of barely audible metallic clicks. 

When the noisy output signal containing the effects of data errors 
was compared to the original speech plus additive noise, the effect of 
the data errors was imperceptible in the case of the substantial additive 
input noise power o 2 ni = 9.18 X 10 4 , N being 8. The effect of unwanted 
scrambling when the channel was noisy ranged from barely perceptible, 
N = 256 and 32, to nonannoying crackles when N = 8 andaL was only 
918. 

The conclusion is that for the data-error rates of practical signifi- 
cance, the degradation in speech quality is insignificant. 

The transmitted signal sounded like distorted speech, plus white 
noise for the case of A7 = 8, and an ideal channel. The effect of additive 
background or channel noise was to reduce the fatiguing effects, as if 
the distortion had been removed from the speech and the background 
noise increased. When N = 32, the channel ideal, the distortion was 
increased as this block size corresponds to 4 ms, approximately half a 
pitch period. The speech sounded as if speaking and gargling were 
being performed simultaneously. The act of adding noise marginally 
reduced listening fatigue. The scrambled signal was found to be just 
intelligible when N = 256. 
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(X= 12.8 PERCENT) 




Fig. 13 — Effect of additive channel noise, (a) Original speech, (b) Transmitted signal 
with additive channel noise having oL = 920. (c) Recovered speech having data errors 
(A = 12.8 percent) and additive noise. N = 8,T = 0,K = 0.6. 



VI. DISCUSSION 

We started by enunciating a principle: that data could be transmitted 
by making it the scrambling key, and casting the receiver in the role 
of code breaker. Every time the receiver guesses the key, it obtains the 
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correct data and the correct speech. The speech is made an unwitting 
data carrier, while the data gets a free ride. The implications of this 
concept are considerable. Continuous users, or providers, of telephone 
traffic can, at the expense of additional terminal equipment, surrepti- 
tiously transmit teleprinter data, with the proviso that the bandwidth 
of the speech signal and the block size are appropriate for the channel 
bandwidth. 

To demonstrate the principle, we wanted a scrambling technique 
that was easy to implement, and where the bandwith of the scrambled 
speech was not larger than that of the original speech. Frequency 
inversion scrambling aptly fulfilled these requisites, where the scram- 
bling is achieved by merely altering the polarity of every other speech 
sample. We have shown that by using this form of scrambling, it is 
possible to transmit speech and data simultaneously, and to receive 
the data without errors and the speech without distortion, even in the 
presence of additive-channel and background noise. Provided some 
data errors can be tolerated, the data rate can be substantially in- 
creased as shown in Figs. 6-12. Even at high data-bit rates the 
distortion in the speech was found to be minimal, as the results in 
Table II indicate. 

We have not presented results for dispersive channels, although we 
did do some experiments. The effect of such channels was to signif- 
icantly alter the pdf of the correlation coefficient of the received signal 
compared to that of the transmitted signal. The power in the blocks of 
speech was also changed by the dispersive properties of the channel. 
As the data detection procedure is based on a measurement of power 
and correlation in a block of N samples, where the correlation is 
usually the most important factor, the dispersive channel results in an 
unacceptably high tber. Thus, in the presence of dispersive channels, 
equalization of the channel must be performed. 

The speech used in our experiments were two sentences whose 
waveforms are displayed in Fig. 3. Therefore, the results will differ 
when other sentences are used, but not significantly, as the sentences 
used consisted of over thirty-eight thousand samples. The system 
proposed here is for conveying data on speech or short silences. When 
prolonged silences occur, we envisage data being transmitted by con- 
ventional modem techniques. 

The basic principle established, the way forward is to find scrambling 
methods that will be easier to break with certainty, and will operate 
via dispersive channels without the necessity of channel equalization. 
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